library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(mgcv)
## Loading required package: nlme
##
## Attaching package: 'nlme'
##
## The following object is masked from 'package:dplyr':
##
## collapse
##
## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
cyto <- read_csv('data/HOT_cyto_counts_edit.csv') %>%
select(!'...1')
## New names:
## Rows: 1217 Columns: 10
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (10): ...1, botid, date, press, chl, hbact, pro, syn, picoeuk, cruise
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
mlds <- read_csv('data/hot_mlds.csv')
## Rows: 160 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (7): crn, date, julian, n, mean, stdDev, stdErr
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We will focus only on pro, hbact, and picoeuk, and explore how the niches of these three groups differ from one another.
“botid” (bottle ID), “date” (date, in mmddyy format), “press” (pressure in decibars) “chl” (fluorometric chlorophyll a concentration, in micrograms L-1), “hbact” (heterotrophic bacteria concentration, in 105 cells mL-1), “pro” (Prochlorococcus concentration, in 105 cells mL-1), “syn” (Synechococcus concentration, in 105 cells mL-1), “picoeuk” (photosynthetic picoeukaryote concentration, in 105 cells mL-1), and “cruise” (cruise ID). Note that pressure in decibars is approximately equal to depth in meters (i.e., depth of the sampling device is measured using a pressure sensor).
#1.
To create a day of the year predictor you will need to first convert the ‘date’ column to date format, and then make a new column that extracts the day of the year from the date column. There are helpful functions in the package ‘lubridate’ that will do these steps for you.
# convert date column to date format - then extract day of year from date column
cyto <- cyto %>%
mutate(date = mdy(date),
day = yday(date)
)
When fitting the 2D smoother for each type of microbe, consider how the response variable should be modeled (transformed or not, normal or non-normal). You can see the probability distributions available for the gam() function in package mgcv by looking at the help file titled ‘family.mgcv’. - assess normality - assess potential leverage of outlier points
# look at dist for each of three response variables
hist(cyto$pro) # bimodal? square root transform - possible high leverage to right
hist(sqrt(cyto$pro)) # this still isn't normal, dist appears more even now tho
hist(log(cyto$pro)) # heavily skewed, don't use
hist(cyto$hbact) # normal - all good
hist(cyto$picoeuk) # Poisson? square root transform
hist(sqrt(cyto$picoeuk)) # now appears normal, looks good!
incorporate response variable transformations
# square root transform pro
cyto$sqrt_pro <- sqrt(cyto$pro)
# square root transform picoeuk
cyto$sqrt_picoeuk <- sqrt(cyto$picoeuk)
For each of the three groups of microbes (pro, hbact, and picoeuk), fit a 2D smoother that characterizes how abundance changes with depth and with day of the year (i.e., from day 1 to day 365 or 366).
sqrt_pro_gam
# sqrt_pro gam
gam_sqrt_pro = gam(sqrt_pro ~ s(press, day), data = cyto) # smoothing off press, day
gam.check(gam_sqrt_pro) # residuals look perfect, k = 29
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 8 iterations.
## The RMS GCV score gradient at convergence was 5.08649e-07 .
## The Hessian was positive definite.
## Model rank = 30 / 30
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(press,day) 29.0 27.4 1.07 1
summary(gam_sqrt_pro) # edf 27.4 - 90% dev explained
##
## Family: gaussian
## Link function: identity
##
## Formula:
## sqrt_pro ~ s(press, day)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.037586 0.004633 223.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(press,day) 27.39 28.84 375.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.899 Deviance explained = 90.1%
## GCV = 0.026751 Scale est. = 0.026127 n = 1217
plot(gam_sqrt_pro) # not sure how to interpret these
# large peak in sqrt_pro around press of 70 and from roughly March to Oct?
hbact gam
# hbact gam
gam_hbact = gam(hbact ~ s(press, day), data = cyto) # smoothing off press, day
gam.check(gam_hbact) # residuals look great, k = 29
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 6 iterations.
## The RMS GCV score gradient at convergence was 2.491103e-06 .
## The Hessian was positive definite.
## Model rank = 30 / 30
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(press,day) 29.0 22.8 0.96 0.065 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(gam_hbact) # edf 22.9 - 74% dev explained
##
## Family: gaussian
## Link function: identity
##
## Formula:
## hbact ~ s(press, day)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.88022 0.01811 214.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(press,day) 22.85 26.88 123.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.732 Deviance explained = 73.7%
## GCV = 0.40709 Scale est. = 0.39911 n = 1217
plot(gam_hbact) # small peak for hbact press of 30, in roughly Sep
sqrt_picoeuk_gam
# sqrt_picoeuk gam
gam_sqrt_picoeuk = gam(sqrt_picoeuk ~ s(press, day), data = cyto) # smoothing off press, day
gam.check(gam_sqrt_picoeuk) # residuals look good, k = 29
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 5 iterations.
## The RMS GCV score gradient at convergence was 6.964021e-07 .
## The Hessian was positive definite.
## Model rank = 30 / 30
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(press,day) 29.0 24.3 1.03 0.82
summary(gam_sqrt_picoeuk) # edf 24.3 - only 52% dev explained
##
## Family: gaussian
## Link function: identity
##
## Formula:
## sqrt_picoeuk ~ s(press, day)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0933431 0.0006991 133.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(press,day) 24.3 27.72 45.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.514 Deviance explained = 52.3%
## GCV = 0.00060737 Scale est. = 0.00059475 n = 1217
plot(gam_sqrt_picoeuk) # not sure how to interpret these
# several small peaks in sqrt_picoeuk - deep early, shallow mid, deep late in year
Consider whether the basis dimension needs to be increased beyond the default value. - default edf values: sqrt_pro: 27.4, hbact: 22.9, sqrt_picoeuk: 24.3 - default k values for all: 29 - double k for each gam
double k for sqrt_pro
gam_sqrt_pro_k = gam(sqrt_pro ~ s(press, day, k = 60), data = cyto)
gam.check(gam_sqrt_pro_k) # residuals look good, k = 59 - diff than default
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 5 iterations.
## The RMS GCV score gradient at convergence was 7.900007e-07 .
## The Hessian was positive definite.
## Model rank = 60 / 60
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(press,day) 59.0 38.7 1.08 1
summary(gam_sqrt_pro_k) # edf 38.7 - 90% dev explained
##
## Family: gaussian
## Link function: identity
##
## Formula:
## sqrt_pro ~ s(press, day, k = 60)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.037586 0.004635 223.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(press,day) 38.66 48.7 221.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.899 Deviance explained = 90.2%
## GCV = 0.027022 Scale est. = 0.026141 n = 1217
plot(gam_sqrt_pro_k)
note that doubling the k value for sqrt_pro gam produces different result - suggests that default settings did not allow for enough complexity in the model - although the plot appears identical?
double k for hbact
gam_hbact_k = gam(hbact ~ s(press, day, k = 60), data = cyto)
gam.check(gam_hbact_k) # residuals look great, k = 59
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 4 iterations.
## The RMS GCV score gradient at convergence was 5.820414e-13 .
## The Hessian was positive definite.
## Model rank = 60 / 60
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(press,day) 59.0 27.8 0.97 0.06 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(gam_hbact_k) # edf 27.8 - 74% dev explained
##
## Family: gaussian
## Link function: identity
##
## Formula:
## hbact ~ s(press, day, k = 60)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.8802 0.0181 214.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(press,day) 27.83 37.42 88.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.732 Deviance explained = 73.8%
## GCV = 0.40845 Scale est. = 0.39877 n = 1217
plot(gam_hbact_k) # small peak for hbact press of 30, in roughly Sep
same as above, doubling k changes the edf value suggesting default settings didn’t adequately account for necessary complexity in the model - but the plot is pretty much identical
double k for sqrt_picoeuk
gam_sqrt_picoeuk_k = gam(sqrt_picoeuk ~ s(press, day, k = 60), data = cyto)
gam.check(gam_sqrt_picoeuk_k) # residuals look good, k = 59 - diff than default
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 4 iterations.
## The RMS GCV score gradient at convergence was 1.08866e-07 .
## The Hessian was positive definite.
## Model rank = 60 / 60
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(press,day) 59.0 37.7 1.07 1
summary(gam_sqrt_picoeuk_k) # edf 37.7 - 54% dev explained - basically the same
##
## Family: gaussian
## Link function: identity
##
## Formula:
## sqrt_picoeuk ~ s(press, day, k = 60)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0933431 0.0006909 135.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(press,day) 37.72 47.84 27.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.525 Deviance explained = 54%
## GCV = 0.00059994 Scale est. = 0.00058085 n = 1217
plot(gam_sqrt_picoeuk_k)
in this case it doesn’t appear that doubling k did much for the model - the final edf value was basically equal to the default settings - and again, plot appears almost identical to default settings
Plot the fitted smoother in a way that is visually appealing.
plot(gam_sqrt_pro, select = 1, scheme = 2, lwd = 2)
plot(gam_hbact, select = 1, scheme = 2, lwd = 2)
plot(gam_sqrt_picoeuk, select = 1, scheme = 2, lwd = 2)
Finally, figure out how to test whether the relationship between abundance and depth changes over time or not. - already have a 2D smoother which models the interaction between two vars - so now create 1D smoother which would not incorporate interaction - then compare R2 of these two models for each response var
sqrt_pro - looks constant throughout the year, intermediate depth peak - 2D model fits marginally better
# sqrt_pro 1D gam
gam_sqrt_pro_1 = gam(sqrt_pro ~ s(press) + s(day), data = cyto)
gam.check(gam_sqrt_pro_1) # residuals look good, k = 9
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 8 iterations.
## The RMS GCV score gradient at convergence was 2.248537e-07 .
## The Hessian was positive definite.
## Model rank = 19 / 19
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(press) 9.00 5.98 1.03 0.86
## s(day) 9.00 5.59 0.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(gam_sqrt_pro_1) # 89% dev explained
##
## Family: gaussian
## Link function: identity
##
## Formula:
## sqrt_pro ~ s(press) + s(day)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.037586 0.004936 210.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(press) 5.977 6.773 1373.56 <2e-16 ***
## s(day) 5.588 6.727 15.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.885 Deviance explained = 88.7%
## GCV = 0.029956 Scale est. = 0.029646 n = 1217
plot(gam_sqrt_pro_1)
# for sqrt_pro, doesn't look like day has a large impact
# but definitely see peak of sqrt_pro at intermediate depth, then drop off
hbact - see two peaks throughout the year, intermediate depth peak - 2D model fits marginally better
# hbact 1D gam
gam_hbact_1 = gam(hbact ~ s(press) + s(day), data = cyto)
gam.check(gam_hbact_1) # residuals look great, k = 9
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 8 iterations.
## The RMS GCV score gradient at convergence was 2.191183e-07 .
## The Hessian was positive definite.
## Model rank = 19 / 19
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(press) 9.00 4.51 0.99 0.33
## s(day) 9.00 7.79 0.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(gam_hbact_1) # 71% dev explained
##
## Family: gaussian
## Link function: identity
##
## Formula:
## hbact ~ s(press) + s(day)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.88022 0.01892 205.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(press) 4.511 5.437 518.7 <2e-16 ***
## s(day) 7.791 8.617 14.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.707 Deviance explained = 71%
## GCV = 0.44049 Scale est. = 0.43568 n = 1217
plot(gam_hbact_1)
# two peaks for day
# intermediate depth peak
sqrt_picoeuk - sees much deeper peak than other species, lots of var throughout year - again, 2D model fits marginally better, although neither fit well compared to other sp.
# sqrt_picoeuk 1D gam
gam_sqrt_picoeuk_1 = gam(sqrt_picoeuk ~ s(press) + s(day), data = cyto)
gam.check(gam_sqrt_picoeuk_1) # residuals look good, k = 9
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 7 iterations.
## The RMS GCV score gradient at convergence was 8.951954e-08 .
## The Hessian was positive definite.
## Model rank = 19 / 19
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(press) 9.00 6.36 0.99 0.3
## s(day) 9.00 8.53 0.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(gam_sqrt_picoeuk_1) # only 48% dev explained
##
## Family: gaussian
## Link function: identity
##
## Formula:
## sqrt_picoeuk ~ s(press) + s(day)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.093343 0.000728 128.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(press) 6.359 7.045 137.31 <2e-16 ***
## s(day) 8.527 8.935 14.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.473 Deviance explained = 47.9%
## GCV = 0.00065359 Scale est. = 0.00064506 n = 1217
plot(gam_sqrt_picoeuk_1)
# see peak much deeper than other species
# lots of small spikes throughout the year
What are your interpretations of the results so far? - Different species have different abundance patterns across depth and time - pro and hbact peak at intermediate depths, around 75 (m?), picoeuk peaks deeper, at around 100 (m?) - pro appears relatively constant throughout the year, hbact has two distinct abundance peaks throughout the year (~May and September), picoeuk has four abundance peaks throughout the year.
#2.
To fit GAM(s) including all three groups simultaneously you will need to convert the data to ‘long’ format, where there is a column that contains all the concentrations of all three types of microbes, and a second column that codes which microbe was counted in that row, as well as additional columns for the other model predictors. You can convert to long format by hand using a spreadsheet, or you can use a helpful function called pivot_longer() in the package ‘tidyr’.
cyto_long <- cyto %>%
pivot_longer(
cols = matches("sqrt_pro|hbact|sqrt_picoeuk"),
names_to = "Microbe",
values_to = "Abundance"
)
cyto_long$Microbe <- as.factor(cyto_long$Microbe)
Now let’s compare the niches of the three groups to each other. Use a GAM including all groups simultaneously to simultaneously test three questions: (a) Do the different kinds of microbes have different mean abundances? - Abundance ~ Microbe, not using s b/c not continuous (b) Do the different kinds of microbes have different average depth distributions (i.e., averaging over time)? - s(Press, by = Microbe) (c) Do the different kinds of microbes have different average seasonal dynamics (i.e., averaging over depths)? - s(Day, by = Microbe)
super_model <- gam(Abundance ~ Microbe +
s(press, by = Microbe) +
s(day, by = Microbe),
data = cyto_long)
gam.check(super_model) # residuals do not look good - uh oh!
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 9 iterations.
## The RMS GCV score gradient at convergence was 5.038129e-08 .
## The Hessian was positive definite.
## Model rank = 57 / 57
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(press):Microbehbact 9.00 8.72 0.89 <2e-16 ***
## s(press):Microbesqrt_picoeuk 9.00 1.54 0.89 <2e-16 ***
## s(press):Microbesqrt_pro 9.00 4.92 0.89 <2e-16 ***
## s(day):Microbehbact 9.00 8.64 0.89 <2e-16 ***
## s(day):Microbesqrt_picoeuk 9.00 1.00 0.89 <2e-16 ***
## s(day):Microbesqrt_pro 9.00 2.46 0.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# says k may be too low, b/c close to edf
summary(super_model)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Abundance ~ Microbe + s(press, by = Microbe) + s(day, by = Microbe)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.88022 0.01126 344.5 <2e-16 ***
## Microbesqrt_picoeuk -3.78688 0.01593 -237.7 <2e-16 ***
## Microbesqrt_pro -2.84263 0.01593 -178.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(press):Microbehbact 8.718 8.960 892.621 <2e-16 ***
## s(press):Microbesqrt_picoeuk 1.540 1.896 0.840 0.3440
## s(press):Microbesqrt_pro 4.916 5.853 304.036 <2e-16 ***
## s(day):Microbehbact 8.641 8.962 40.918 <2e-16 ***
## s(day):Microbesqrt_picoeuk 1.000 1.000 0.248 0.6184
## s(day):Microbesqrt_pro 2.460 3.055 5.034 0.0017 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.951 Deviance explained = 95.2%
## GCV = 0.15571 Scale est. = 0.15442 n = 3651
# re-run with diff k value, set at 30, tried 20 but k still close to edf
super_model <- gam(Abundance ~ Microbe +
s(press, by = Microbe, k = 30) +
s(day, by = Microbe, k = 30),
data = cyto_long)
gam.check(super_model) # still don't like residuals
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 18 iterations.
## The RMS GCV score gradient at convergence was 9.285798e-08 .
## The Hessian was positive definite.
## Model rank = 177 / 177
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(press):Microbehbact 29.00 23.07 0.89 <2e-16 ***
## s(press):Microbesqrt_picoeuk 29.00 1.59 0.89 <2e-16 ***
## s(press):Microbesqrt_pro 29.00 4.98 0.89 <2e-16 ***
## s(day):Microbehbact 29.00 28.70 0.90 <2e-16 ***
## s(day):Microbesqrt_picoeuk 29.00 1.00 0.90 <2e-16 ***
## s(day):Microbesqrt_pro 29.00 2.51 0.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(super_model)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Abundance ~ Microbe + s(press, by = Microbe, k = 30) + s(day,
## by = Microbe, k = 30)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.88022 0.01085 357.6 <2e-16 ***
## Microbesqrt_picoeuk -3.78688 0.01534 -246.8 <2e-16 ***
## Microbesqrt_pro -2.84263 0.01534 -185.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(press):Microbehbact 23.068 26.193 330.669 < 2e-16 ***
## s(press):Microbesqrt_picoeuk 1.589 1.964 1.042 0.302888
## s(press):Microbesqrt_pro 4.978 5.937 323.136 < 2e-16 ***
## s(day):Microbehbact 28.704 28.991 21.362 < 2e-16 ***
## s(day):Microbesqrt_picoeuk 1.000 1.000 0.268 0.604983
## s(day):Microbesqrt_pro 2.511 3.126 5.325 0.000998 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.955 Deviance explained = 95.6%
## GCV = 0.14586 Scale est. = 0.14327 n = 3651
# sig differences in abundance across three different microbe types (hbact highest)
# hbact, sqrt_pro vary by depth, not sqrt_picoeuk
# hbact, sqrt_pro vary by day, not sqrt_picoeuk
Now that you have fit models to test questions (a)-(c), make appropriate plots and perform appropriate hypothesis tests.
plot(super_model)
How do you intepret the results? - sig differences in abundance across three different microbe types (hbact highest) - hbact, sqrt_pro vary by depth, not sqrt_picoeuk - hbact, sqrt_pro vary by day, not sqrt_picoeuk
looking at plots: - lots of variation across depth for hbact - no variation across depth for picoeuk - pro decreases as depth increases - lots of variation across day for hbact - no variation across day for picoeuk - effectively no variation across day for pro
#3.
Finally, let’s investigate how abundances of the three groups at shallower depths correlate with mixed layer depth (an index of stratification) and chlorophyll a concentration. The second attached file, hot_mlds.csv, contains the average mixed layer depth for each HOT cruise. You’ll need to merge the information in this file with the dataset you have been analyzing. - run join_by on key - I guess crn is cruise? No consistent key - poor data management practices!
# rename crn to cruise
mlds$cruise <- mlds$crn
joined_df <- cyto_long %>%
left_join(mlds, by = 'cruise')
Using only data from the top 45 meters
filter to only retain top 45 m
joined_df <- joined_df %>%
filter(press <= 45)
hist(joined_df$press)
how does the concentration of each group of microbes vary with (a) mixed layer depth and (b) Chl a concentration? Test whether the three types of microbes exhibit different relationships with the two predictors. - s(mean, by = Microbe) + s(chl, by = Microbe)
compare to model with 1D smoothers - s(mean) + s(chl)
Use appropriate GAM(s), hypothesis tests, and smoother plots to assess these questions.
mlds_2d model
mlds_2d <- gam(Abundance ~ Microbe + s(mean, by = Microbe) +
s(chl, by = Microbe),
data = joined_df)
gam.check(mlds_2d) # don't like residuals at all
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 10 iterations.
## The RMS GCV score gradient at convergence was 2.852528e-08 .
## The Hessian was positive definite.
## Model rank = 57 / 57
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(mean):Microbehbact 9.00 8.61 0.93 0.005 **
## s(mean):Microbesqrt_picoeuk 9.00 1.00 0.93 0.010 **
## s(mean):Microbesqrt_pro 9.00 1.00 0.93 0.010 **
## s(chl):Microbehbact 9.00 4.99 0.93 0.020 *
## s(chl):Microbesqrt_picoeuk 9.00 1.00 0.93 <2e-16 ***
## s(chl):Microbesqrt_pro 9.00 1.36 0.93 0.010 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mlds_2d) # explains 97% deviation
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Abundance ~ Microbe + s(mean, by = Microbe) + s(chl, by = Microbe)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.78352 0.01998 239.4 <2e-16 ***
## Microbesqrt_picoeuk -4.68602 0.02826 -165.8 <2e-16 ***
## Microbesqrt_pro -3.36964 0.02826 -119.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(mean):Microbehbact 8.613 8.955 28.141 < 2e-16 ***
## s(mean):Microbesqrt_picoeuk 1.000 1.000 0.014 0.906
## s(mean):Microbesqrt_pro 1.000 1.000 1.005 0.316
## s(chl):Microbehbact 4.987 6.079 6.628 1.41e-06 ***
## s(chl):Microbesqrt_picoeuk 1.000 1.000 0.014 0.905
## s(chl):Microbesqrt_pro 1.363 1.647 0.755 0.342
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.965 Deviance explained = 96.5%
## GCV = 0.14658 Scale est. = 0.14373 n = 1080
plot(mlds_2d)
# hbact sig variation by mean and chl, not for the others
mlds_1d model
mlds_1d <- gam(Abundance ~ s(mean) + s(chl),
data = joined_df)
gam.check(mlds_1d) # residuals look horrible
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 11 iterations.
## The RMS GCV score gradient at convergence was 2.901478e-07 .
## The Hessian was positive definite.
## Model rank = 19 / 19
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(mean) 9 1 1.48 1
## s(chl) 9 1 1.48 1
summary(mlds_1d)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Abundance ~ s(mean) + s(chl)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.09830 0.06146 34.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(mean) 1 1 1.971 0.161
## s(chl) 1 1 0.351 0.554
##
## R-sq.(adj) = 8.07e-06 Deviance explained = 0.186%
## GCV = 4.0913 Scale est. = 4.0799 n = 1080
plot(mlds_1d)
# no sig relationships between abundance across species and depth, chlorophyll
from 2d model - explains 97% variation - hbact varies by mixed layer depth (several peaks) and chlorophyll a concentration (increases) - neither sqrt_pro or sqrt_picoeuk vary by either depth or chlorophyll a
from 1d model - wow! only explains .19% of variation - this model fits horribly - don’t find any relationship between abundance and either depth or chlorophyll
So yes, the concentration varies of different microbe types varies differently according to depth and chlorophyll concentration - this is driven by hbact, which significantly varies by both variables - the model accounting for differences across microbe types was amazing (~98% R2), whereas the model that didn’t allow for different microbes to respond differently was horrible (<1% R2)